Genome characterisation and comparative analysis of Schaalia dentiphila sp. nov. and its subspecies, S. dentiphila subsp. denticola subsp. nov., from the human oral cavity

Background Schaalia species are primarily found among the oral microbiota of humans and other animals. They have been associated with various infections through their involvement in biofilm formation, modulation of host responses, and interaction with other microorganisms. In this study, two strains previously indicated as Actinomyces spp. were found to be novel members of the genus Schaalia based on their whole genome sequences. Results Whole-genome sequencing revealed both strains with a genome size of 2.3 Mbp and GC contents of 65.5%. Phylogenetics analysis for taxonomic placement revealed strains NCTC 9931 and C24 as distinct species within the genus Schaalia. Overall genome-relatedness indices including digital DNA-DNA hybridization (dDDH), and average nucleotide/amino acid identity (ANI/AAI) confirmed both strains as distinct species, with values below the species boundary thresholds (dDDH < 70%, and ANI and AAI < 95%) when compared to nearest type strain Schaalia odontolytica NCTC 9935 T. Pangenome and orthologous analyses highlighted their differences in gene properties and biological functions compared to existing type strains. Additionally, the identification of genomic islands (GIs) and virulence-associated factors indicated their genetic diversity and potential adaptive capabilities, as well as potential implications for human health. Notably, CRISPR-Cas systems in strain NCTC 9931 underscore its adaptive immune mechanisms compared to strain C24. Conclusions Based on these findings, strain NCTC 9931T (= ATCC 17982T = DSM 43331T = CIP 104728T = CCUG 18309T = NCTC 14978T = CGMCC 1.90328T) represents a novel species, for which the name Schaalia dentiphila subsp. dentiphila sp. nov. subsp. nov. is proposed, while strain C24T (= NCTC 14980T = CGMCC 1.90329T) represents a distinct novel subspecies, for which the name Schaalia dentiphila subsp. denticola. subsp. nov. is proposed. This study enriches our understanding of the genomic diversity of Schaalia species and paves the way for further investigations into their roles in oral health. Significance This research reveals two Schaalia strains, NCTC 9931 T and C24T, as novel entities with distinct genomic features. Expanding the taxonomic framework of the genus Schaalia, this study offers a critical resource for probing the metabolic intricacies and resistance patterns of these bacteria. This work stands as a cornerstone for microbial taxonomy, paving the way for significant advances in clinical diagnostics. Supplementary Information The online version contains supplementary material available at 10.1186/s12866-024-03346-w.


Introduction
Schaalia bacteria are Gram-strain positive and aerobic to facultatively anaerobic bacteria with morphology ranging from straight to slightly curved rods, with certain strains exhibiting branching patterns [1].Members of the genus Schaalia predominantly inhabit the oral microbiota of humans and other animals and have been associated with certain human diseases.For instance, Schaalia odontolytica (formerly known as Actinomyces odontolyticus) is a species linked to oral infections [2,3] and Schaalia turicensis was isolated from a surgical biopsy [4].Schaalia species, along with others, play a pivotal role in maintaining oral health through their ability to form biofilm, modulate host responses, and interact with other microbial residents [3,5].
There are currently 12 species with correct and validly published names within the genus Schaalia (https:// lpsn.dsmz.de/ genus/ schaa lia) [6].However, fewer than ten genome assemblies for type strains of these species are publicly available within the National Center of Biotechnology Information (NCBI) database [7].Additionally, only 66 search results were associated with the "Schaalia" term in the PubMed database (Accessed on April 15, 2024).This limited number of search results can be attributed to the fact that much of the research relevant to Schaalia was previously published under its former name, Actinomyces.However, most of the previous studies have focused on aspects such as biofilm formation and interactions with other bacteria, with less emphasis on the classification and systematic genomic analysis of new species under the Schaalia genus.This historical context of nomenclature change contributes to the apparent scarcity of data specifically labeled under the name Schaalia, highlighting the need for updated research and classification in this area.
In our comparative genomics study on oral Actinomyces, a few strains that were initially classified as Actinomyces showed significant genomic differences from typical Actinomyces strains, in particular strains previously reported as Actinomyces odontolyticus NCTC 9931 and Actinomyces odontolyticus C24 [8].Strain C24, originally isolated from root surface caries of supragingival plaque in Papua New Guinea indigenes [9], was transferred to Newcastle University in 1993 by Prof. Dr. Roy R.B. Russell from the University of Sydney, Australia, and strain NCTC 9931 was used as one of the reference strains.It is important to note that biochemical tests within the API system offer only tentative species identification and whole genome sequencing can provide better taxonomic identity with comprehensive genomic profile.
In this study, strain NCTC 9931 was found to represent a novel Schaalia species, for which the name Schaalia dentiphila sp.nov. is proposed.Additionally, strain C24 which showed highly similar yet distinct genomic properties when compared to strain NCTC 9931, is proposed as a subspecies of Schaalia dentiphila.This research provides valuable insights into the diversity and genome characteristics of this important bacterial genus associated with human oral health and sets the stage for further investigations.

Strains preparation
Strains are currently maintained in the laboratory of Prof. Nicholas S. Jakubovics at Newcastle University, United Kingdom, and their subcultures have been deposited in the collections of the National Collection of Type Cultures (NCTC), UK, and China General Microbiological Culture Collection Center (CGMCC), China.For routine cultivation, the strains were grown in BHYE broth, composed of 37 g/L of brain heart infusion and 5 g/L of yeast extract, under an anaerobic atmosphere at 37 °C.For long-term storage, strains were preserved at -80 °C in a broth solution containing 40% (v/v) glycerol.

Genomic DNA extraction and library preparation for sequencing
The MasterPure ™ Gram Positive DNA Purification Kit was used to extract the total DNA from bacterial cells following the manufacturer's instructions.The quantification of DNA was determined using a NanoDrop spectrophotometer (Thermo Scientific), and DNA integrity and purity were verified using 1% (w/v) agarose gel electrophoresis.The genomic DNA (1 μg) was used for library construction by random fragmentation of the DNA using Covaris S2 for 120 s at 5.5-6.0 °C.The DNA fragments were then selected with an average size of 200-400 base pairs using magnetic beads and quantified using a Qubit fluorometer (Thermo Fisher Scientific).These fragments underwent end-repair and 3' adenylation using T4 DNA polymerase, Klenow DNA polymerase, and T4 polynucleotide kinase.Adapter sequences were subsequently ligated to the blunt ends of the 3' adenylated fragments.The ligated products were amplified by PCR.Following purification, the double-stranded PCR products were heat-denatured and circularized using the split oligo sequence.The final library was then prepared from the single-strand circular DNA and sequenced on the Illumina HiSeq X Ten platform with PE151 (pairedend) sequencing, following the manufacturer's recommended protocols (Illumina).The resulting raw sequence data were generated in fastq format.

Data preprocessing, genome assembly and annotation
The generated raw data sequencing underwent preprocessing, which involved filtering out lower quality data including adapter and duplication contamination, using PRINSEQ lite software [10].Bactopia pipeline (v 1.7.1)[11] was employed to assemble clean reads into genomes.The assembled quality was evaluated via the Quality Assessment Tool for Genome Assemblies [12].Additionally, the genome completeness assessment was conducted through the CheckM (v 1.2.2) tool [13].To further verify the genome assembly quality, a more accurate genome assembly assessment online interface, gVolante2 [14] was used based on the BUSCO v5 tool [15].All assembled genomes with high quality were annotated using the Rapid Annotation using Subsystem Technology (RAST) webserver (https:// rast.nmpdr.org) [16].For consistency, publicly available type strain genomes of the Schaalia species with validly published name, were downloaded from the NCBI RefSeq database [17], and subjected to identical annotation using RAST.

16S rRNA phylogeny and multilocus sequence analysis
To determine the preliminary species identification, the 16S rRNA gene sequences extracted from the RAST genome annotation were analysed using the EzBio-Cloud (www.ezbio cloud.net) [18].The 16S rRNA gene phylogeny was inferred with 16S rRNA gene sequences of twelve Schaalia type strains and Actinomyces naeslundii JCM 8349 T from the NCBI database.Similarly, multiple (gene/loci) sequence analyses were conducted according to previously described methods [19].The sequences of five housekeeping genes, including atpA (ATP synthase subunit alpha), rpoB (RNA polymerase, β-subunit), pgi (glucose-6-phosphate isomerase), metG (methionyl-tRNA synthetase), and gyrA (DNA gyrase subunit A) were extracted from RAST genome annotation.These sequences were aligned and concatenated into a continuous sequence for a multilocus sequence analysis [20].
For core genome single-nucleotide polymorphisms (SNPs) analysis, we employed the PGAP (Prokaryotic Genome Annotation Pipeline) pangenomes analysis pipeline.This was used for the extraction and identification of SNPs from each genome sequence [21] with default parameters, except the percentage identity cutoff of the core genome SNPs was 50% and the core genome threshold was confirmed as the number of genomes employed.The core genome SNPs sequences of all genomes were compiled into a single file to construct a robust core genome SNPs phylogenetic tree.
In addition, the autoMLST (Automated Multi-Locus Species Tree) system was employed to infer the phylogeny of strains NCTC 9931 and C24, along with seven available Schaalia type strain genomes.Actinomyces naeslundii NCTC 10301 T was used as an outgroup.The analysis was based on core genes from genomes using the concatenated gene tree method under the default settings [22].
In the above analyses, the MEGA X (v 10.2.6) software was used to align all sequences using the MUSCLE algorithm with default settings.The maximum likelihood (ML) tree was inferred based on the Kimura two-parameter model with 1,000 bootstrap replicates.Additionally, two more 16S rRNA gene-based phylogenetic trees were built using the neighbor-joining and maximum parsimony methods in MEGA X with 1,000 bootstraps [23,24].

Overall genome relatedness index
The overall genome relatedness index was calculated to validate the species delineation based on average nucleotide identity (ANI), average amino acid identity (AAI), and digital DNA-DNA hybridization (dDDH) values.Genomes of strains NCTC 9931 and C24, along with available reference genomes for Schaalia type strains and Schaalia odontolytica non-type strains, as well as the outgroup Actinomyces naeslundii NCTC 10301 T and Actinomyces israelii DSM 43320 T , were retrieved from the NCBI database.Digital DNA-DNA hybridization (dDDH) values were determined using the Genome-to-Genome Distance Calculator (GGDC 3.0) based on the Genome BLAST Distance Phylogeny (GBDP) d 4 formula using the Type Stain Genome Server (TYGS) [25].The dDDH threshold value is less than 70% for species delineation and below 79% for subspecies delineation [26,27].The dDDH results were visualised in a heatmap generated with the GSP 6.0 Heatmap Illustrator (Heml 2.0) tool [28].
For the ANI and AAI analyses, we computed all-vsall distances among all uploaded genome and protein sequences (from RAST annotation).The calculation was based on both one-way ANI and AAI (best hits) and two-way ANI and AAI (reciprocal best hits) metrics using the online tool developed by Kostas lab with the recommended parameters [26,29].The threshold values for species delineation below 95-96% similarity for ANI and AAI, respectively.The threshold value above 96% and up to 99.5% similarity, indicates that the strains are the same species but potentially different subspecies [19,30,31].

Pangenome and orthologous comparative analysis
To understand the genetic composition of species, encompassing the core, accessory, and unique genomes, two additional strains (Schaalia odontolytica NCTC 9935 T and Schaalia meyeri DSM 20733 T ) that were phylogenetically closest to strains NCTC 9931 and C24 were selected based on the autoMLST genome tree.We employed the Roary pipeline (Galaxy Version 3.13.0+ galaxy2) for the pangenome analysis [32].The genomes of four strains were annotated using the Prokka Prokaryotic genome annotation tool (Galaxy Version 1.14.6 + galaxy1), keeping job resource parameters to their default settings [33,34].This step generated gff3 files for each genome.These gff3 files were then input into the Roary pipeline for pangenome analysis.
Within the pipeline, we set our criteria such that the minimum percentage protein sequence identity for blastp was fixed at 95%.Any gene present in at least 99% of the isolates was designated as part of the core genome, while all other parameters remained at their default settings.For visual representation of the pangenome analysis results, we utilized the BioVenn online tool [35].For a deeper understanding of core and pan-genome functional annotations (COGs: Clusters of Orthologous Groups and KEGG: Kyoto Encyclopedia of Genes and Genomes) and comprehensive comparisons, we further employed the Bacterial Pangenome Analysis Pipeline (BPGA), known for its rapid analysis capabilities [36].Similarly, the protein sequences of these selected strains obtained from RAST annotation were clustered based on the USEARCH Clustering Algorithm with protein sequence identity cut-off at 95%.Furthermore, we used the Orthovenn3 web server to perform a genome comparison analysis of strains NCTC 9931 and C24 against the selected genomes of Schaalia type strain [37,38].Specifically, the protein sequences of these selected strains obtained from RAST annotation were uploaded to Orthovenn3 to perform an analysis using OrthoFinder algorithm [39] with parameter settings, inflation value at 2 and E-value at 1e-5.

Genomic islands (GIs) prediction and pathogenicity assessment
To investigate the presence of genomic islands (GIs), Schaalia odontolytica NCTC 9935 T and Schaalia meyeri DSM 20733 T were selected for comparison based on their phylogenetic proximity in the genome tree.We utilized the RAST annotated gbk files of these strains and predicted genomic islands (GIs) through the IslandViewer web server following the approaches described previously [40].For alignment purposes, Schaalia odontolytica XH001 served as the reference genome.We reordered the genome contigs of our strains relative to this reference using the Mauve tool [41].The reordered genomic sequences, along with the identified GI locations, were downloaded for further visualization.To estimate bacterial pathogenicity, assembled fasta files of our strains were evaluated using the PathogenFinder 1.1 online platform with Actinobacteria as the organism class for a tailored analysis [42].

Identification of virulence factor and CRISPR-Cas system analysis
The genomes of two strains were also examined alongside the type strain genomes of the Schaalia genus, specifically Schaalia odontolytica NCTC 9935 T and Schaalia meyeri DSM 20733 T .Our primary tool for virulence factor detection was the Virulence Factor Database (VFDB) [43,44].Utilizing this database, we aligned the nucleotide sequences of our strains against a dataset of 32,672 nucleotide sequences (retrieved on 20 August 2023) from VFDB's comprehensive set (setB), adhering to the default parameters.We considered hits exhibiting over 70% identity as significant based on the criteria reported [45].To delve into the CRISPR-Cas system within our strains, we employed the CRISPR-Cas + + online tool [46].We uploaded the genome sequences of strains NCTC 9931 and C24, Schaalia odontolytica NCTC 9935 T , and Schaalia meyeri DSM 20733 T to this platform.Using the CRISPR-CasFinder program, we pinpointed both the CRISPRs and the associated Cas genes within these genomes.

Genome-based taxonomic description
To characterize the novel species, we utilized the Protologger (Galaxy Version 1.0.0) to extract additional taxonomic, functional, and ecological features [47].The 16S rRNA gene sequences and assembled genome sequences of two novel strains were submitted to the Protologger platform via its online portal (http:// www.proto logger.de/).The formal description was included in compliance with the requirement of the International Code of Nomenclature of Prokaryotes (2022 Revision) [48].

Genome profiles of novel species
The sequencing generated 1,469 Mb of raw data for both strains.After removing low-quality data, 1,183 Mb and 1,168 Mb of high-quality data were obtained for strain NCTC 9931 and C24, respectively (Table 1).
The refined data for strain NCTC 9931, comprising 9,796,674 reads, was assembled into 36 contigs with an N50 value of 230,219 bp.This resulted in a draft genome of 2,374,847 bp with a GC content of 65.5%.A total of 9,796,674 reads for strain C24 were assembled into 22 contigs, yielding an N50 value of 155,844 bp, and the genome size of the draft genome was 2,345,519 bp with a similar GC content of 65.5%.The genome assembly quality is more than 97% for both strains, with 100% genome completeness and contamination below 0.5%, suggesting the high quality of the two assemblies (Table 1).
Subsequent RAST annotations revealed 2,102 coding sequences (CDSs) and 52 RNAs in strain NCTC 9931, while strain C24 harbored 2,069 CDSs and 51 RNAs (Table 1).Interestingly, approximately 28% of CDSs in each strain, 584 in strain NCTC 9931 and 576 in strain C24, matched with subsystem features classified into 23 distinct categories.However, a substantial proportion, over 72% of CDSs in both strains remained uncharacterized in any specific categories (Figure S1A-B).
Among the identified categories, the subsystem of protein metabolism stood out prominently in both strains, accounting for 142 CDSs in strain NCTC 9931 and 144 CDSs in strain C24.This dominance suggests that the synthesis, modification, and degradation of proteins are vital processes for these bacteria, potentially reflecting their capacity for protein metabolism and their adaptation to independent growth in the oral cavity, where proteins are often a key source of nutrients.Following protein metabolism, the subsystems of amino acids and derivatives (138 CDSs for NCTC 9931, 139 CDSs for C24), carbohydrates (109 CDSs for NCTC 9931, 106 CDSs for C24), and cofactors, vitamins, prosthetic groups, and pigments (92 CDSs for both strains) were also well-represented (Figure S1A-B).These subsystems underscore a broad metabolic repertoire, indicating that strains NCTC 9931 and C24 may possess metabolic versatility beneficial for survival in oral ecosystems.
A noteworthy observation is the presence of genes associated with virulence, disease, and defense, with strain NCTC 9931 having 22 CDSs and strain C24 possessing 21 CDSs in this category (Figure S1A-B).This hints at potential mechanisms for host interaction, suggesting that these strains might interact with hosts either as symbionts or opportunistic pathogens.

Phylogenetic relationship analysis
The complete 16S rRNA gene sequences of strain NCTC 9931 and strain C24 were obtained from RAST genome annotation.For strain identification, these sequences underwent analysis using the 16S-based identification tool from EzBioCloud.As shown in Fig. 1A, both strains exhibited high sequence similarities to Schaalia odontolytica CCUG 20536 T (= NCTC 9935 T ), with identities of 99.57% and 99.08%, respectively, suggesting a close phylogenetic relationship with this species.Table 2 provides information on species and accession numbers.
To further elucidate their phylogenetic positions, we downloaded the 16S rRNA gene sequences of twelve Schaalia type species, along with the outgroup reference type strain Actinomyces naeslundii JCM 8349 T from the NCBI database to construct a phylogenetic tree.As shown in Fig. 1B and Figure S1C-D, strain C24 forms a distinct branch on the phylogenetic tree, highlighting potential evolutionary divergence.Whereas strain NCTC 9931, despite sharing the same branch with Schaalia odontolytica CCUG 20536 T (= NCTC 9935 T ), clearly shows distance based on the branch length to their common ancestor.This indicates that even in cases of high 16S rRNA sequence similarity, other aspects of genetic analysis can reveal important evolutionary differences.Further analysis is necessary to validate these findings and possibly explore additional genetic markers to clarify these phylogenetic relationships.
For further granularity, concatenated sequences of five housekeeping genes (atpA, rpoB, pgi, metG, and gyrA) were compared against sequences from seven available Schaalia type strains.The resulting maximum-likelihood tree based on these housekeeping gene sequences revealed that strains NCTC 9931 and C24 were closely related to each other but distinct from Schaalia odontolytica NCTC 9935 T and Schaalia meyeri DSM 20733 T (Figure S2A).While these observations provide significant insights, these phylogenetic inferences were inconclusive in determining the taxonomic positions of the strains NCTC 9931 and C24.Core genome SNPs sequences, extracted from the genomes of strains NCTC 9931 and C24 and juxtaposed against their reference type strains, formed the foundation for another phylogenetic tree (Figure S2B).The parallels observed between the results of the core genomes SNPs analysis and the housekeeping gene analysis underpin the consistency in these analytical methodologies.However, while the core genome SNPs analysis largely corroborates the housekeeping gene results, inherent genome and gene variations across species can impact the perceived taxonomic relationships, emphasizing the importance of employing a multifaceted analysis approach for robust species classification.

Genome relatedness analysis
Whole-genome analysis provides maximal resolution, capturing minute variations between microbes that might be overlooked in other methods [49].Despite analyses such as 16S rRNA, housekeeping genes, and core genome SNPs, there remains a compelling need for whole-genome phylogenetic analysis to acquire the most detailed and accurate evolutionary relationships, offering robust evidence for species classification and identification.Strains NCTC 9931 and C24 form the independent branch, distinguishing them from the closely related species Schaalia odontolytica NCTC 9935 T and Schaalia meyeri DSM 20733 T (Fig. 1C and Table S1).This indicates that the strains likely represent novel species or distinct subspecies within the genus Schaalia.The multigene sequence analysis via autoMLST aligned perfectly with the core genome SNPs analysis results, reinforcing the reliability of these methodologies (Fig. 1C and Figure S2B).
The DNA-DNA hybridization (DDH) has been the gold standard in bacterial species delineation, and the digital DNA-DNA hybridization (dDDH) version offers a highresolution tool for genome-based taxonomy, allowing for precise species demarcation in the genomic era [50][51][52].To further confirm the species identities of the analysed strains, we conducted a genome comparison analysis between strains NCTC 9931 and C24 with available reference type strains using the Type (strain) Genome Server (TYGS).Neither strain NCTC 9931 nor C24 displayed dDDH values greater than 70% to any of the existing reference type strains (Fig. 1D).Interestingly, the dDDH value between these two strains stood at 75%, indicating both strains as the same species but distinct subspecies within the genus Schaalia.
A comparative analysis of all available Schaalia odontolytica genomes from the NCBI database.TYGS analysis revealed that none of the fifteen Schaalia odontolytica genomes (derived from three strains and twelve metagenome-assembled genomes) shared species-level identity with strains NCTC 9931 and C24 (referenced in Figure S2C and Table S2).However, Schaalia odontolytica ATCC 17982 showed a 100% dDDH value with strain NCTC 9931, indicating that they are the same species (Figure S2D), Based on the records of culture collections, strain ATCC 17982 was a subculture of strain NCTC 9931 (https:// doi.org/ 10. 13145/ bacdi ve156.20230 509.8.1).When both strains ATCC 17982 and NCTC 9931 were compared to Schaalia odontolytica NCTC 9935 T , their phylogenetic relationship was distant, with dDDH values below 70%, suggesting that they should not have been classified as Schaalia odontolytica.These observations also suggest that the strains displaying the same species identity to strain NCTC 9931 should be reclassified, to properly reflect their taxonomic status.
Subsequent ANI analyses revealed that neither strains NCTC 9931 nor C24 exhibited ANI values greater than 95% when compared to other available reference type strains.The ANI value between the two strains is 97% (Fig. 2A).This suggests that strains NCTC 9931 and C24 represent novel species within the genus Schaalia and can be further differentiates as distinct subspecies.Additional ANI analysis also showed that strain ATCC 17982 shares significant genomic similarity with strain NCTC 9931 and C24, with ANI values of 100% and 97%, respectively (Figure S3).The 100% ANI value between strains ATCC 17982 and NCTC 9931 confirmed the clonality of both strains [53].In addition, AAI analysis of strains NCTC 9931 and C24 revealed consistent results with ANI analysis (Fig. 2B), further supporting their distinct taxonomic positions within the genus Schaalia.

Pangenome analysis elucidates the genetic landscape and diversification
To comprehensively understand the genetic composition of novel species, encompassing the core, accessory, and unique genomes, we selected Schaalia odontolytica NCTC 9935 T and Schaalia meyeri DSM 20733 T that were phylogenetically closest to strains NCTC 9931 and C24 based on the genome tree to perform the pangenome analysis with a blastp sequence identity cutoff of 95% using BPGA and Roary [32,36,54].The corepan genome plot showcased the varying number of gene families as the analysed genomes increased (Fig. 3A).The core genome curve indicates the consistent gene families across all strains, while the pangenome curve represents the cumulative gene families.Mathematical modeling revealed a power-law relationship for the pangenome, given by f(x) = a.x^b with parameters a = 2,040.32and b = 0.73 (Fig. 3A).The open nature of the pangenome, implies ongoing genetic diversification within the Schaalia genus, emphasizing the potential for discovering new genes as more genomes are sequenced.A marked genetic distinctiveness was observed when the BPGA pangenome analysis was conducted using a stringent 95% protein sequence identity cut-off: a core of 147 gene families was conserved across the four strains.This contrasted sharply with the unique genetic footprints of the two novel species and the referenced strains (Fig. 3B).Roary's computation of the pangenome and core genome sizes (5,271 and 165, respectively) resonated well with BPGA's findings (Figure S4).Notably, a side-byside comparison of strains NCTC 9931 and C24 revealed a shared genetic backbone of 1,662 core gene families.However, strain-specific signatures emerged with NCTC 9931 and C24 boasting 436 and 403 unique gene families, respectively (Fig. 3B).This genetic demarcation not only accentuates their distinct biological functionalities but also fortifies their status as unique subspecies.
Delving deeper into the functional landscape of the core, accessory, and unique gene families, we embarked on the COGs and KEGG functional analysis for the quartet of strains.The COGs distribution spotlighted four predominant categories: information storage and processing, metabolism, cellular processes and signaling, and a cohort of poorly characterized genes (Figure S5A).The core genes primarily steered translation, ribosomal structure and biogenesis, transcription, energy production, and carbohydrate metabolism (Fig. 3C).In contrast, accessory and unique genes had a predilection for general function prediction, carbohydrate transport, and amino acid metabolism.
The KEGG distribution, on the other hand, painted a vivid picture of six major categories, with a heavy inclination of core genes towards translation, carbohydrate metabolism, and energy metabolism (Fig. 3D & S5B).The accessory genes showcased a diverse array, from carbohydrate metabolism to amino acid metabolism and membrane transport.Collectively, these findings weave a tale of both shared heritage and individual genetic evolution, underscoring the complexity and diversity harboured within the Schaalia genus.Fig. 2 ANI and AAI analysis.A Matrix with ANI results between two analysed strains and seven Schaalia reference type strains, ANI values were estimated using both best hits (one-way ANI) and reciprocal best hits (two-way ANI) between two genomic datasets.B Matrix with AAI results between two analyzed strains and seven Schaalia reference type strains, AAI values were estimated using both best hits (one-way AAI) and reciprocal best hits (two-way AAI) between two genomic datasets.Actinomyces naeslundii NCTC 10301 T was employed as an outgroup species.The pink-shaded area indicates the values with ANI and AAI > 96%, and the analyzed strains are highlighted in bold in the matrix
Delving deeper, gene family expansions and contractions elucidated evolutionary trajectories.Relative to their common ancestor, strain NCTC 9931 had 11 expanded and 23 contracted gene families, while strain C24 manifested expansions in 21 and contractions in 19 gene families.This substantiates their genetic divergence not only from each other but also from Schaalia meyeri DSM 20733 T and Schaalia odontolytica NCTC 9935 T (Fig. 4B).
Gene Ontology (GO) enrichment analysis of the expanded gene families in strain C24 spotlighted eight principal categories, including cytochrome complex assembly, DNA-templated transcription initiation, and glycolytic Fig. 3 Pangenome analysis by BPGA.A The plot of the number of gene families in the core and pangenome against the number of genomes of four selected strains.B Venn diagram showing the numbers of core, accessory, and unique genes present in four selected strains.C COGs distribution of core, accessory, and unique genes of four selected strains.The distribution highlights the functional roles of genes within each category.D Distribution of KEGG pathway classification in core, accessory, and unique genomes in the four strains.The four selected strains are NCTC 9931, C24, Schaalia odontolytica NCTC 9935 T , and Schaalia meyeri DSM 20733 T processes, among others (Fig. 4C & Figure S6A).Strain NCTC 9931's expanded gene set, on the other hand, was associated with processes like glutamine metabolism, inositol catabolism, and transmembrane transport (Fig. 4D & Figure S6B).The contracted gene families in strain C24 were involved in 22 biological processes and two cellular components, while those in NCTC 9931 were mapped to 12 biological processes and a singular molecular function (Figure S6C & S6D).
These differential GO enrichments fortify the taxonomical distinctions between strain NCTC 9931 and strain C24.They also suggest distinct adaptive and evolutionary pathways, perhaps indicative of varied ecological niches or functional roles these strains might play in their respective environments.

Genomic features and characterisation
To gain deeper insights into the properties and functions of two novel species, we conducted genomic island (GIs) analysis by comparing them to the closest two reference strains Schaalia odontolytica NCTC 9935 T and Schaalia meyeri DSM 20733 T .Our analysis revealed that strains NCTC 9931 and C24 harbored 15 and 13 putative GIs, respectively, while Schaalia odontolytica NCTC 9935 T has 20 putative GIs and Schaalia meyeri DSM 20733 T possess 12 GIs (Figure S7).Among the 15 GIs identified in the assembled genome of strain NCTC 9931, the largest GI was 35,492 bp, containing 30 genes, while the second largest GI has a size of 25,997 bp, consisting of 23 genes (Table S4).For strain C24, the largest GIs were 21,984 bp, with 20 genes.In contrast, the top GIs of Schaalia odontolytica NCTC 9935 T were 28,369 Fig. 4 Orthologous cluster identification and comparative analysis.A Representation of orthologous gene clusters across four selected strains.The Venn diagram illustrates the unique and shared orthologous gene clusters, complemented by a bar chart that quantitatively details the number of clusters for each strain.B Analysis of gene family evolution in the four strains, highlighting contracted and expanded gene families.The pie chart visually contrasts the number of contracted (in purple) versus expanded (in blue) gene families, displaying the evolution of gene families and differences between species.C GO enrichment analysis for 21 expanded gene families matched eight GO categories in strain C24.D GO enrichment analysis for 11 expanded gene families matched eight GO categories in strain NCTC 9931.The four strains analysed are NCTC 9931, C24, Schaalia odontolytica NCTC 9935 T , and Schaalia meyeri DSM 20733 T bp and 27,615 bp, encompassing 24 and 33 genes, individually (Table S4).The top GIs of Schaalia meyeri DSM 20733 T were 19,739 bp with 22 genes.GIs are gene clusters predicted to have integrated into the genome through horizontal transfer and often contain functions related to adaptation to the environment [19].These clear differences in GIs between strains NCTC 9931 and C24 compared to the reference Schaalia odontolytica NCTC 9935 T and Schaalia meyeri DSM 20733 T provide further evidence for classification as novel species.
All four strains share key genes associated with stress survival (sodA), regulation (relA, sigA/rpoV), and immune modulation (nuoG, rmlA), showcasing common pathogenic mechanisms.Notably, the absence of the exoenzyme-associated gene zmp1 (encoding putative zincdependent metalloprotease-1) in strain NCTC 9931, C24, and Schaalia meyeri DSM 20733 T compared to Schaalia odontolytica NCTC 9935 T implies variations in their interaction with the host, which could impact their infection mechanisms and pathogenicity, and may lead to different pathological processes and clinical manifestations.Additionally, while most strains share common metabolic genes, their composition and number vary (Table 3 and Figure S8), indicating potential differences in nutritional strategies and environmental adaptation.A Venn diagram (Figure S8) visually summarizes these findings, showing the unique and shared virulence factors across the strains.
CRISPR-Cas analysis for novel species identified 9 CRISPRs and 14 Cas genes in strain NCTC 9931, with the largest CRISPR array consisting of 122 spacers (Table S6 & Table S7).In contrast, stain C24 exhibited only five CRISPRs without Cas, and the largest CRISPR array consisted of only four spacers (Table S6 & Table S7).Schaalia odontolytica NCTC 9935 T possessed ten CRISPRs and one Cas gene, while Schaalia meyeri DSM 20733 T also has nine CRISPRs without the Cas gene.Notably, strain NCTC 9931 hosted a greater number of Cas types and subtypes (Table S7), suggesting that strain NCTC 9931 may have significant functional defenses compared to the other three strains.The CRISPR-Cas system serves as an adaptive immune system found in bacteria, providing protection against foreign genetic elements such as bacteriophages and plasmids.CRISPR-Cas analysis suggests that strain NCTC 9931 might have encountered a wide variety of genetic invaders in its environment and evolved a robust defence mechanism against them.A higher number of CRISPR-Cas genes signifies a greater potential for adaptability and survival in challenging environments.

Discussion
In this study, the genomic landscapes of strains NCTC 9931 and C24, as unveiled by whole-genome sequencing, offer pivotal insights into their biology and evolutionary history.The sequencing quality and subsequent annotations affirm the robustness of the assembled genomes.Both strains, despite their close phylogenetic proximity to Schaalia odontolytica, distinguish themselves through various analyses, cementing their status as novel taxa within the genus Schaalia.
While phylogenetic analyses based on 16S rRNA, housekeeping genes, and core genome SNPs can suggest genetic relatedness, they often fall short in definitively characterizing organisms as novel species or subspecies [56,57].A comprehensive genome dissection and comparative analysis provide a more substantial foundation for species identification [52].In this context, dDDH, ANI, and AAI emerge as pivotal metrics.These methodologies fundamentally rely on whole-genome comparisons rather than individual or a handful of genes, making them the gold standard for bacterial species identification, widely accepted across the scientific community [58][59][60].Interestingly, while ANI and AAI thresholds for subspecies demarcation remain debatable, our analysis of strains NCTC 9931 and C24 revealed 97% ANI, suggesting a notable divergence.We argue that this 3% difference could signify subspecies-level variations.Consequently, in our study, values below 99.5% for ANI and AAI were designated as indicative of novel subspecies.This threshold, supported by our comprehensive pangenome and functional characterization, highlights the genomic distinctions between the two strains.The threshold range for subspecies boundary was also supported by previous study cases [19,30,31,61].
It is critical to note that the choice of reference genomes is vital for the classification and nomenclature of new species.Consequent to the International Code For pangenome analysis, the choice of reference organisms is crucial.In our study, the genomes of Schaalia odontolytica NCTC 9935 T and Schaalia meyeri DSM 20733 T , which are evolutionarily closest to strains NCTC 9931 and C24, were selected.Significant disparities between our strains and their closest relatives strengthen their identity as distinct species.However, a consensus on the protein sequence identity threshold for pangenome analysis remains elusive.While many studies default to a 50% blastp threshold [64][65][66][67], our analyses (both BPGA and Roary) employed a stringent 95% blastp threshold [54,68], which ensures that only sequences with very high similarity are deemed homologous.Given the close evolutionary ties and subtle differences expected between NCTC 9931 and C24, a higher threshold may help in discerning highly conserved core genes from accessory and unique genes, minimizing false positives [68,69].
In our orthologous gene cluster comparison using OrthoVenn3, an inflation value of 2 was adopted, yielding consistent results with other tested values (1.5 and 5).It's important to note that the algorithmic differences between the Markov Clustering Algorithm (MCL) in OrthoVenn3 and USEARCH in pangenome tools can lead to varying gene cluster definitions [21,32,70].Therefore, the core genome cluster obtained in pangenome analysis was 147, while the shared orthologous gene clusters were 1471.Despite these discrepancies, the overarching findings from both analyses converged, highlighting the strains' unique genomic signatures.The expansion and contraction of gene families in strains NCTC 9931 and C24 hint at the evolutionary pressures they've faced, such as gene duplication, loss, or horizontal gene transfer.These shifts can influence their survival strategies and interactions in their environments [71].
Our study also unveiled horizontally transferred Genomic Islands (GIs) that often harbor genes essential for survival in specific environments, conferring advantages such as antibiotic resistance or virulence [72,73].Although we have no direct evidence pointing to the pathogenic potential of strains NCTC 9931 and C24, our prediction analysis indicates their potential pathogenic risk.Genes like sodA, crucial for microbial oxidative stress resistance, and nuoG and rmlA, important for immune modulation, were identified.The sodA gene encodes for Mn-SOD in bacteria, which has a critical antioxidative function.Mn-SOD aids bacteria in defending against assaults from the host immune system, particularly from superoxide anions produced by neutrophils.This capability allows bacteria to survive and multiply within the host, thereby amplifying their pathogenicity [74,75].The rmlA gene is responsible for producing the enzyme glucose-1-phosphate thymidylyltransferase.This enzyme plays a crucial role in the biosynthesis pathways of bacterial lipopolysaccharide (LPS) and other surface polysaccharides, and has been found to have a direct impact on the pathogenicity of bacteria [76,77].The presence of these genes underscores the need for careful research into these strains and vigilance regarding their pathogenic potential in oral diseases.
Additionally, the identification of the significant number of CRISPR-Cas elements in strain NCTC 9931 offers insights into its adaptive immune mechanisms against foreign genetic elements.In contrast, the limited system in C24 might indicate its reliance on alternate protective measures or a more stable and shielded environment.The CRISPR-Cas system serves as an adaptive immune mechanism in bacteria and archaea, offering protection against foreign genetic entities such as bacteriophages and plasmids [78,79].Central to the Type I CRISPR-Cas system is the Cas3 protein, which is responsible for the cleavage of target DNA [80].Besides, the Cas1 protein, in tandem with Cas2, plays a pivotal role in spacer acquisition [81].These spacers act as historical records, offering insights into previous encounters with phages and plasmids [82].Given this information, it can be inferred that the NCTC 9931 bacterium, equipped with the CRISPR-Cas system-notably the Type I and Type I-E variants, possesses a sophisticated adaptive immunity mechanism, enabling it to ward off intruding genetic material.
In the broader perspective, understanding Schaalia's interactions in the oral microbiome, its adaptive responses, and its implications in systemic diseases becomes vital in the current research landscape.As antibiotic resistance surges, delving deep into genera like Schaalia becomes increasingly urgent.The genomic blueprints of Schaalia species are repositories of evolutionary tales, metabolic pathways, and potential virulence mechanisms.Tools like dDDH and ANI & AAI have demystified their evolutionary connections and have distinguished them from other genera.As we usher into an advanced phase of microbial research, Schaalia stands as a beacon, emphasizing that every sequenced genome and studied interaction enriches our grasp on the intricate tapestry of life.

Conclusions
Whole-genome sequence analyses indicated the phylogenetic placement of strains, NCTC 9931 and C24, as independent lineages.Furthermore, overall genomerelatedness index analyses confirmed the taxonomic positions of these strains as belonging to the same species but distinct subspecies within the genus Schaalia.The taxonomic placements of the strains were also supported by the pangenome and orthologous gene cluster analyses which also highlighted their differences in gene properties and biological functions.The expansion and contraction of gene families in both strains hint at their evolutionary pressures and functional differences.Furthermore, the presence of CRISPR and Cas in strain NCTC 9931 genes indicated a robust defense mechanism.By providing in-depth genomic analyses on two novel strains, this study lays foundational groundwork for future studies, paving the way for novel insights and breakthroughs in oral bacteria research.
The type strain NCTC 9931 T (= ATCC 17982 T = DSM 43331 T = CIP 104728 T = CCUG 18309 T = NCTC 14978 T = CGMCC 1.90328 T ) was isolated from carious lesions of the human dentine and has a genome size of 2.37 Mbp with a GC content of 65.5%.Description of Schaalia dentiphila subsp.dentiphila subsp.nov.
Utilise mannose as sole carbohydrate source for growth.Reduce nitrate.Smooth brownish colonies with 0.5-1.0mm in diameter with α-haemolysis on Columbia blood agar.Additional phenotypic characteristics as summarized in Table 4.
The type strain NCTC 9931 T (= ATCC 17982 T = DSM 43331 T = CIP 104728 T = CCUG 18309 T = NCTC 14978 T = CGMCC 1.90328 T ) has a genome size of 2.37 Mb with a GC content of 65.5%.
Smooth red colonies with 0.5-1.0mm in diameter with β-haemolysis on Columbia blood agar.Additional phenotypic characteristics as summarized in Table 4.
The type strain C24 T (= NCTC 14980 T = CGMCC 1.90328 T ) was isolated from supragingival plaque-associated root surface caries in Papua New Guinea indigenes and has a genome size of 2.35 Mb with a GC content of 65.5%.

Fig. 1
Fig.1Phylogenetic and genome sequence analysis of Schaalia strains.A Nucleotide identity heatmap illustrating the similarity of 16S rRNA gene sequence of strain NCTC 9931 and strain C24 compared to Schaalia type strains (EzBioCloud database).Sequence similarity was calculated using the 16S-based identification tool.B Maximum-likelihood tree based on 16S rRNA gene sequences of both strains compared to Schaalia type strains.Actinomyces neaslundii JCM 8349 T was employed as an outgroup.Bootstrap value (percentage) was computed based on 1,000 bootstrap replicates, and values with more than 50% are shown.Note that Schaalia odontolytica CCUG 20536 T and NCTC 9935 T are synonyms of the same strain.C autoMLST genome tree showing the relationship of strains NCTC 9931 and C24 to seven available Schaalia type strains.The tree was visualized using MEGA X, and bootstrap values above 50% are displayed on the tree.D Heatmap with dDDH values between two analysed strains and seven Schaalia reference type strains, the dDDH values were calculated based on the confidence interval of formula d 4 in GBDP.Actinomyces naeslundii NCTC 10301 T served as an outgroup.Analysed strains are highlighted in bold in this figure

Table 1
Summary of sequencing data, genome assembly and annotation

Table 2
List of strains with nearly complete 16S rRNA gene sequences used in the phylogenetic analysis.Note that Schaalia odontolytica CCUG 20536 T and NCTC 9935 T are synonyms of the same strain

Table 3
Virulence factors were predicted in four selected strains against the VFDB database